This report explores the relationships between various factors influencing student performance, using exploratory data analysis (EDA) to identify key trends and correlations. The analysis focuses on variables such as study habits, access to resources, parental involvement, and environmental factors, and how they impact final exam scores. Insights gained from the data will inform recommendations aimed at improving academic outcomes for students.
The dataset was sourced from Kaggle under the CC0 1.0 universal “No Copyright” license. We are free to copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. Learn more about this license here here.
URL for data in Kaggle: Student Performance Factors Dataset
student_data <- read.csv('../data/StudentPerformanceFactors.csv', header = TRUE)
student_data # Display the dataset Hours_Studied Attendance Parental_Involvement Access_to_Resources
Min. : 1.00 Min. : 60.00 Length:6607 Length:6607
1st Qu.:16.00 1st Qu.: 70.00 Class :character Class :character
Median :20.00 Median : 80.00 Mode :character Mode :character
Mean :19.98 Mean : 79.98
3rd Qu.:24.00 3rd Qu.: 90.00
Max. :44.00 Max. :100.00
Extracurricular_Activities Sleep_Hours Previous_Scores
Length:6607 Min. : 4.000 Min. : 50.00
Class :character 1st Qu.: 6.000 1st Qu.: 63.00
Mode :character Median : 7.000 Median : 75.00
Mean : 7.029 Mean : 75.07
3rd Qu.: 8.000 3rd Qu.: 88.00
Max. :10.000 Max. :100.00
Motivation_Level Internet_Access Tutoring_Sessions Family_Income
Length:6607 Length:6607 Min. :0.000 Length:6607
Class :character Class :character 1st Qu.:1.000 Class :character
Mode :character Mode :character Median :1.000 Mode :character
Mean :1.494
3rd Qu.:2.000
Max. :8.000
Teacher_Quality School_Type Peer_Influence Physical_Activity
Length:6607 Length:6607 Length:6607 Min. :0.000
Class :character Class :character Class :character 1st Qu.:2.000
Mode :character Mode :character Mode :character Median :3.000
Mean :2.968
3rd Qu.:4.000
Max. :6.000
Learning_Disabilities Parental_Education_Level Distance_from_Home
Length:6607 Length:6607 Length:6607
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Gender Exam_Score
Length:6607 Min. : 55.00
Class :character 1st Qu.: 65.00
Mode :character Median : 67.00
Mean : 67.24
3rd Qu.: 69.00
Max. :101.00
'data.frame': 6607 obs. of 20 variables:
$ Hours_Studied : int 23 19 24 29 19 19 29 25 17 23 ...
$ Attendance : int 84 64 98 89 92 88 84 78 94 98 ...
$ Parental_Involvement : chr "Low" "Low" "Medium" "Low" ...
$ Access_to_Resources : chr "High" "Medium" "Medium" "Medium" ...
$ Extracurricular_Activities: chr "No" "No" "Yes" "Yes" ...
$ Sleep_Hours : int 7 8 7 8 6 8 7 6 6 8 ...
$ Previous_Scores : int 73 59 91 98 65 89 68 50 80 71 ...
$ Motivation_Level : chr "Low" "Low" "Medium" "Medium" ...
$ Internet_Access : chr "Yes" "Yes" "Yes" "Yes" ...
$ Tutoring_Sessions : int 0 2 2 1 3 3 1 1 0 0 ...
$ Family_Income : chr "Low" "Medium" "Medium" "Medium" ...
$ Teacher_Quality : chr "Medium" "Medium" "Medium" "Medium" ...
$ School_Type : chr "Public" "Public" "Public" "Public" ...
$ Peer_Influence : chr "Positive" "Negative" "Neutral" "Negative" ...
$ Physical_Activity : int 3 4 4 4 4 3 2 2 1 5 ...
$ Learning_Disabilities : chr "No" "No" "No" "No" ...
$ Parental_Education_Level : chr "High School" "College" "Postgraduate" "High School" ...
$ Distance_from_Home : chr "Near" "Moderate" "Near" "Moderate" ...
$ Gender : chr "Male" "Female" "Male" "Male" ...
$ Exam_Score : int 67 61 74 71 70 71 67 66 69 72 ...
[1] 0
We now check whether the dataset contains any missing values and remove them if necessary.
Hours_Studied Attendance
0 0
Parental_Involvement Access_to_Resources
0 0
Extracurricular_Activities Sleep_Hours
0 0
Previous_Scores Motivation_Level
0 0
Internet_Access Tutoring_Sessions
0 0
Family_Income Teacher_Quality
0 0
School_Type Peer_Influence
0 0
Physical_Activity Learning_Disabilities
0 0
Parental_Education_Level Distance_from_Home
0 0
Gender Exam_Score
0 0
We can see that there are no missing values in the dataset.
We now check the unique values in each categorical column to identify any inconsistencies.
# Unique values in each categorical column
lapply(student_data[, sapply(student_data, is.character)], unique)$Parental_Involvement
[1] "Low" "Medium" "High"
$Access_to_Resources
[1] "High" "Medium" "Low"
$Extracurricular_Activities
[1] "No" "Yes"
$Motivation_Level
[1] "Low" "Medium" "High"
$Internet_Access
[1] "Yes" "No"
$Family_Income
[1] "Low" "Medium" "High"
$Teacher_Quality
[1] "Medium" "High" "Low" ""
$School_Type
[1] "Public" "Private"
$Peer_Influence
[1] "Positive" "Negative" "Neutral"
$Learning_Disabilities
[1] "No" "Yes"
$Parental_Education_Level
[1] "High School" "College" "Postgraduate" ""
$Distance_from_Home
[1] "Near" "Moderate" "Far" ""
$Gender
[1] "Male" "Female"
Low
Medium
High
High
Medium
Low
No
Yes
Low
Medium
High
Yes
No
Low
Medium
High
Medium
High
Low
Public
Private
Positive
Negative
Neutral
No
Yes
High School
College
Postgraduate
Near
Moderate
Far
Male
Female
From the results above we can Teacher_Quality, Parental_Education_Level and Distance_from_Home have missing values. We will now investigate further to see exactly what these missing values are.
# Check for missing values in Teacher_Quality
teacher_quality_missing <- student_data[student_data$Teacher_Quality == "",]
teacher_quality_missing[1] 78
We can see that only 78 rows have missing values in the Teacher_Quality column. We will now investigate the Parental_Education_Level column.
# Check for missing values in Parental_Education_Level
parental_education_level_missing <- student_data[student_data$Parental_Education_Level == "",]
parental_education_level_missing[1] 90
We can see that 90 rows have missing values in the Parental_Education_Level column. We will now investigate the Distance_from_Home column.
# Check for missing values in Distance_from_Home
distance_from_home_missing <- student_data[student_data$Distance_from_Home == "",]
distance_from_home_missing[1] 67
We can see that 67 rows have missing values in the Distance_from_Home column. All the missing values combined make up less than 10% of the dataset. We will remove these rows from the dataset.
# Remove rows with missing values
student_data <- subset(student_data, Teacher_Quality != "" & Parental_Education_Level != "" & Distance_from_Home != "")
# Check for missing values
lapply(student_data[, sapply(student_data, is.character)], unique)$Parental_Involvement
[1] "Low" "Medium" "High"
$Access_to_Resources
[1] "High" "Medium" "Low"
$Extracurricular_Activities
[1] "No" "Yes"
$Motivation_Level
[1] "Low" "Medium" "High"
$Internet_Access
[1] "Yes" "No"
$Family_Income
[1] "Low" "Medium" "High"
$Teacher_Quality
[1] "Medium" "High" "Low"
$School_Type
[1] "Public" "Private"
$Peer_Influence
[1] "Positive" "Negative" "Neutral"
$Learning_Disabilities
[1] "No" "Yes"
$Parental_Education_Level
[1] "High School" "College" "Postgraduate"
$Distance_from_Home
[1] "Near" "Moderate" "Far"
$Gender
[1] "Male" "Female"
Low
Medium
High
High
Medium
Low
No
Yes
Low
Medium
High
Yes
No
Low
Medium
High
Medium
High
Low
Public
Private
Positive
Negative
Neutral
No
Yes
High School
College
Postgraduate
Near
Moderate
Far
Male
Female
We now investigate the dependent variable, Final_Exam_Score, to identify any outliers.
# Outliers in Final_Exam_Score
outliers_in_exam_score <- student_data[student_data$Exam_Score > 100,]
outliers_in_exam_score[1] 1
We can see that only 1 student got exam score of 101 which is an outlier. We will remove this row from the dataset.
Now that the data has been cleaned, we can proceed with the exploratory data analysis.
Here we will explore the distribution of final exam scores among students with without considering other factors. To find our the distribution of final exam scores, we first need to sample the data and plot a histogram.
# Sample the data
set.seed(123)
exam_score_sample <- student_data$Exam_Score[sample(nrow(student_data), 100)]
exam_score_sample [1] 66 70 72 68 62 69 72 66 58 69 63 67 70 64 65 70 75 70 69 69 68 63 75 69 68
[26] 66 68 70 69 64 60 67 66 70 69 65 65 69 69 66 64 64 66 66 66 72 61 71 66 65
[51] 63 69 70 73 66 70 64 68 71 69 63 68 63 65 70 66 71 71 87 72 67 66 71 64 67
[76] 63 72 64 68 66 75 70 64 67 65 66 63 69 68 65 68 65 61 71 69 68 66 61 59 65
# Plot histogram
hist(exam_score_sample, main = "Distribution of Final Exam Scores", xlab = "Final Exam Score", col = "skyblue", border = "black")From the histogram, we can see that the distribution of final exam scores is approximately normal. We now plot a boxplot to visualize the spread of scores and identify any outliers.
# Boxplot of final exam scores
boxplot(exam_score_sample, main = "Boxplot of Final Exam Scores", col = "skyblue", border = "black")The boxplot shows that the distribution of final exam scores is centered around the median, with a few outliers on the lower end of the scale. Now we use numerical methods to confirm the normality of the distribution.
Shapiro-Wilk normality test
data: exam_score_sample
W = 0.92839, p-value = 4.035e-05
The Shapiro-Wilk test confirms that the distribution of final exam scores is not normal, with a p-value less than 0.05.
Next, we explore the distribution of final exam scores based on parental involvement levels. We will create a boxplot to compare the scores of students with different levels of parental involvement.
# Sample the data
high_parental_involvement <- student_data$Exam_Score[student_data$Parental_Involvement == "High"][sample(sum(student_data$Parental_Involvement == "High"), 100)]
medium_parental_involvement <- student_data$Exam_Score[student_data$Parental_Involvement == "Medium"][sample(sum(student_data$Parental_Involvement == "Medium"), 100)]
low_parental_involvement <- student_data$Exam_Score[student_data$Parental_Involvement == "Low"][sample(sum(student_data$Parental_Involvement == "Low"), 100)]
# Histogram of final exam scores by parental involvement
par(mfrow = c(1, 3))
hist(high_parental_involvement, main = "High Parental Involvement", xlab = "Final Exam Score", col = "skyblue", border = "black")
hist(medium_parental_involvement, main = "Medium Parental Involvement", xlab = "Final Exam Score", col = "skyblue", border = "black")
hist(low_parental_involvement, main = "Low Parental Involvement", xlab = "Final Exam Score", col = "skyblue", border = "black")The three histograms show the distribution of final exam scores for students with high, medium, and low levels of parental involvement. We can see that the distribution of the scores seems to be similar across all three categories. They seem to follow a normal distribution, with a slight skew towards higher scores for students with high parental involvement. We now use numerical methods to confirm the normality of the distributions.
Shapiro-Wilk normality test
data: high_parental_involvement
W = 0.97588, p-value = 0.06325
Shapiro-Wilk normality test
data: medium_parental_involvement
W = 0.97777, p-value = 0.08891
Shapiro-Wilk normality test
data: low_parental_involvement
W = 0.98902, p-value = 0.5864
The Shapiro-Wilk test confirms that the distributions of final exam scores for all students with of parental involvement are approximately normal, with p-values greater than 0.05.
Next, we explore the distribution of final exam scores based on access to resources. We will create a boxplot to compare the scores of students with different levels of access to resources.
# Sample the data
high_access_to_resources <- student_data$Exam_Score[student_data$Access_to_Resources == "High"][sample(sum(student_data$Access_to_Resources == "High"), 100)]
medium_access_to_resources <- student_data$Exam_Score[student_data$Access_to_Resources == "Medium"][sample(sum(student_data$Access_to_Resources == "Medium"), 100)]
low_access_to_resources <- student_data$Exam_Score[student_data$Access_to_Resources == "Low"][sample(sum(student_data$Access_to_Resources == "Low"), 100)]
# Histogram of final exam scores by access to resources
par(mfrow = c(1, 3))
hist(high_access_to_resources, main = "High Access to Resources", xlab = "Final Exam Score", col = "skyblue", border = "black")
hist(medium_access_to_resources, main = "Medium Access to Resources", xlab = "Final Exam Score", col = "skyblue", border = "black")
hist(low_access_to_resources, main = "Low Access to Resources", xlab = "Final Exam Score", col = "skyblue", border = "black")plot(density(low_access_to_resources), main = "Low Access to Resources", xlab = "Final Exam Score", col = "skyblue")
plot(density(medium_access_to_resources), main = "Medium Access to Resources", xlab = "Final Exam Score", col = "skyblue")
plot(density(high_access_to_resources), main = "High Access to Resources", xlab = "Final Exam Score", col = "skyblue")The histograms and density plots show the distribution of final exam scores for students with high, medium, and low levels of access to resources. The distributions seem to be similar across all three categories, with a slight skew towards higher scores for students with high access to resources. We will now use numerical methods to confirm the normality of the distributions.
Shapiro-Wilk normality test
data: high_access_to_resources
W = 0.97748, p-value = 0.0844
Shapiro-Wilk normality test
data: medium_access_to_resources
W = 0.97329, p-value = 0.03968
Shapiro-Wilk normality test
data: low_access_to_resources
W = 0.97248, p-value = 0.0343
The Shapiro-Wilk test confirms that the distributions of final exam scores for students with high resources are approximately normal, with a p-value greater than 0.05. However, the distributions for students with medium and low resources are slightly skewed, with p-values less than 0.05. This conclude that the distribution of final exam scores for students with high resources is normal, while the distributions for students with medium and low resources are slightly skewed.
Next, we explore the distribution of final exam scores based on participation in extracurricular activities. We will create a boxplot to compare the scores of students who participate in extracurricular activities and those who do not.
# Sample the data
participate_extracurricular <- student_data$Exam_Score[student_data$Extracurricular_Activities == "Yes"][sample(sum(student_data$Extracurricular_Activities == "Yes"), 100)]
do_not_participate_extracurricular <- student_data$Exam_Score[student_data$Extracurricular_Activities == "No"][sample(sum(student_data$Extracurricular_Activities == "No"), 100)]
# Boxplot of final exam scores by extracurricular activities
boxplot(student_data$Exam_Score ~ student_data$Extracurricular_Activities, main = "Final Exam Scores by Extracurricular Activities", xlab = "Extracurricular Activities", ylab = "Final Exam Score", col = "skyblue", border = "black")The boxplot shows that students who participate in extracurricular activities tend to have higher final exam scores compared to those who do not. Now we will visualize the distribution of scores for both groups using histograms.
# Histogram of final exam scores by extracurricular activities
par(mfrow = c(1, 2))
hist(participate_extracurricular, main = "Extracurricular Activities", xlab = "Final Exam Score", col = "skyblue", border = "black")
hist(do_not_participate_extracurricular, main = "No Extracurricular Activities", xlab = "Final Exam Score", col = "skyblue", border = "black")Both histograms show normal distributions of final exam scores for students who participate in extracurricular activities and those who do not. We will now use numerical methods to confirm the normality of the distributions.
Shapiro-Wilk normality test
data: participate_extracurricular
W = 0.97923, p-value = 0.1158
Shapiro-Wilk normality test
data: do_not_participate_extracurricular
W = 0.98142, p-value = 0.1712
The Shapiro-Wilk test confirms that both distributions of final exam scores for students who participate in extracurricular activities and those who do not are approximately normal, with p-values greater than 0.05.
Next we explore the distribution of final exam scores based on motivation levels. We will create a histogram to compare the scores of students with different motivation levels.
# Smaple Data
high_motivation <- student_data$Exam_Score[student_data$Motivation_Level == "High"][sample(sum(student_data$Motivation_Level == "High"), 100)]
medium_motivation <- student_data$Exam_Score[student_data$Motivation_Level == "High"][sample(sum(student_data$Motivation_Level == "Medium"), 100)]
low_motivation <- student_data$Exam_Score[student_data$Motivation_Level == "High"][sample(sum(student_data$Motivation_Level == "Low"), 100)]
# Histogram of final exam scores by motivation level
par(mfrow = c(1, 3))
hist(high_motivation, main = "High Motivation Level", xlab = "Final Exam Score", col = "skyblue", border = "black")
hist(medium_motivation, main = "Medium Motivation Level", xlab = "Final Exam Score", col = "skyblue", border = "black")
hist(low_motivation, main = "Low Motivation Level", xlab = "Final Exam Score", col = "skyblue", border = "black")Higher motivation levels correlate with more consistent and concentrated exam scores around a central range. Lower motivation levels are associated with greater variability and more scores clustering in the lower range. Motivation level seems to have a strong influence on exam performance, with higher motivation linked to better and more uniform outcomes.
Shapiro-Wilk normality test
data: high_motivation
W = 0.94412, p-value = 0.0003468
Shapiro-Wilk normality test
data: medium_motivation
W = 0.94732, p-value = 0.07268
Shapiro-Wilk normality test
data: low_motivation
W = 0.83108, p-value = 5.376e-07
The Shapiro-Wilk test confirms that the distributions of final exam scores for students with medium motivation levels are approximately normal, with p-value greater than 0.05. However, the distribution for students with high and low motivation levels is slightly skewed, with a p-value less than 0.05.
Next, we explore the distribution of final exam scores based on internet access. We will create a boxplot to compare the scores of students with and without internet access.
# Sample the data
internet_access <- student_data$Exam_Score[student_data$Internet_Access == "Yes"][sample(sum(student_data$Internet_Access == "Yes"), 100)]
no_internet_access <- student_data$Exam_Score[student_data$Internet_Access == "No"][sample(sum(student_data$Internet_Access == "No"), 100)]
# Histogram of final exam scores by internet access
par(mfrow = c(1, 2))
hist(internet_access, main = "Internet Access", xlab = "Final Exam Score", col = "skyblue", border = "black")
hist(no_internet_access, main = "No Internet Access", xlab = "Final Exam Score", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: internet_access
W = 0.96856, p-value = 0.01718
Shapiro-Wilk normality test
data: no_internet_access
W = 0.93958, p-value = 0.0001819
The Shapiro-Wilk test confirms that the distribution of final exam scores for students with and without internet access is not normal, with a p-value less than 0.05.
In summary, the analysis of the distribution of final exam scores and study hours across various categories such as parental involvement, access to resources, extracurricular activities, motivation levels, internet access, family income, teacher quality, school type, peer influence, learning disabilities, parental education levels, distance from home to school, and gender shows no significant differences. The distributions are approximately normal in most cases, with a few exceptions where the distributions are slightly skewed. Overall, these factors do not have a significant impact on the final exam scores or the number of hours students study per week.
Next, we explore the average number of hours students study per week without considering other factors.
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.98 24.00 44.00
The summary statistics show that the average number of hours students study per week is approximately 19.98 hours. We will now visualize the distribution of study hours using a histogram.
# Histogram of study hours
hist(student_data$Hours_Studied, main = "Distribution of Study Hours", xlab = "Study Hours", col = "skyblue", border = "black")The distribution of study hours is approximately normal. We will now use numerical methods to confirm the normality of the distribution.
# Shapiro-Wilk test for normality
sample_hours_studied <- student_data$Hours_Studied[sample(nrow(student_data), 100)]
shapiro.test(sample_hours_studied)
Shapiro-Wilk normality test
data: sample_hours_studied
W = 0.99057, p-value = 0.7103
The p-value is greater than 0.05, indicating that the distribution of study hours is approximately normal.
Next, we will explore the average number of hours students study per week based on parental involvement levels.
# Summary statistics for study hours by parental involvement
summary(student_data$Hours_Studied[student_data$Parental_Involvement == "High"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.86 24.00 44.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.99 24.00 39.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 16.00 20.00 20.11 24.00 38.00
The summary statistics of the average number of hours students study per week based on parental involvement levels do not show significant differences. We will now use ANOVA to test for differences in study hours based on parental involvement levels. But first, we need to check the assumptions of ANOVA.
# Sample the data
high_parental_involvement <- student_data$Hours_Studied[student_data$Parental_Involvement == "High"][sample(sum(student_data$Parental_Involvement == "High"), 100)]
medium_parental_involvement <- student_data$Hours_Studied[student_data$Parental_Involvement == "Medium"][sample(sum(student_data$Parental_Involvement == "Medium"), 100)]
low_parental_involvement <- student_data$Hours_Studied[student_data$Parental_Involvement == "Low"][sample(sum(student_data$Parental_Involvement == "Low"), 100)]
# Histogram of study hours by parental involvement
par(mfrow = c(1, 3))
hist(high_parental_involvement, main = "High Parental Involvement", xlab = "Study Hours", col = "skyblue", border = "black")
hist(medium_parental_involvement, main = "Medium Parental Involvement", xlab = "Study Hours", col = "skyblue", border = "black")
hist(low_parental_involvement, main = "Low Parental Involvement", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: high_parental_involvement
W = 0.9827, p-value = 0.2145
Shapiro-Wilk normality test
data: medium_parental_involvement
W = 0.99085, p-value = 0.7331
Shapiro-Wilk normality test
data: low_parental_involvement
W = 0.986, p-value = 0.3737
Both the histograms and the Shapiro-Wilk test show that the distribution of study hours is approximately normal for all students. We will now use ANOVA to test for differences in study hours based on parental involvement levels.
# ANOVA test for study hours by parental involvement
anova_parental_involvement <- aov(Hours_Studied ~ Parental_Involvement, data = student_data)
summary(anova_parental_involvement) Df Sum Sq Mean Sq F value Pr(>F)
Parental_Involvement 2 49 24.44 0.682 0.506
Residuals 6374 228362 35.83
The ANOVA test shows that there is no significant difference in the average number of hours students study per week based on parental involvement levels, with a p-value greater than 0.05. This indicates that parental involvement does not have a significant impact on the number of hours students study per week.
We now explore the average number of hours students study per week based on access to resources.
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 20.02 24.00 39.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 16.0 20.0 19.9 24.0 43.0
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 16.00 20.00 20.12 24.00 44.00
The summary statistics of the average number of hours students study per week based on access to resources do not show significant differences. We will now use ANOVA to test for differences in study hours based on access to resources. But first, we need to check the assumptions of ANOVA.
# Sample the data
high_access_to_resources <- student_data$Hours_Studied[student_data$Access_to_Resources == "High"][sample(sum(student_data$Access_to_Resources == "High"), 100)]
medium_access_to_resources <- student_data$Hours_Studied[student_data$Access_to_Resources == "Medium"][sample(sum(student_data$Access_to_Resources == "Medium"), 100)]
low_access_to_resources <- student_data$Hours_Studied[student_data$Access_to_Resources == "Low"][sample(sum(student_data$Access_to_Resources == "Low"), 100)]
# Histogram of study hours by access to resources
par(mfrow = c(1, 3))
hist(high_access_to_resources, main = "High Access to Resources", xlab = "Study Hours", col = "skyblue", border = "black")
hist(medium_access_to_resources, main = "Medium Access to Resources", xlab = "Study Hours", col = "skyblue", border = "black")
hist(low_access_to_resources, main = "Low Access to Resources", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: high_access_to_resources
W = 0.98025, p-value = 0.139
Shapiro-Wilk normality test
data: medium_access_to_resources
W = 0.96582, p-value = 0.01068
Shapiro-Wilk normality test
data: low_access_to_resources
W = 0.97123, p-value = 0.02747
Both the histograms and the Shapiro-Wilk test show that the distribution of study hours is approximately normal for all students. We will now use ANOVA to test for differences in study hours based on access to resources.
# ANOVA test for study hours by access to resources
anova_access_to_resources <- aov(Hours_Studied ~ Access_to_Resources, data = student_data)
summary(anova_access_to_resources) Df Sum Sq Mean Sq F value Pr(>F)
Access_to_Resources 2 48 24.12 0.673 0.51
Residuals 6374 228363 35.83
The ANOVA test shows that there is no significant difference in the average number of hours students study per week based on access to resources, with a p-value greater than 0.05. This indicates that access to resources does not have a significant impact on the number of hours students study per week.
Next, we explore the average number of hours students study per week based on participation in extracurricular activities.
# Summary statistics for study hours by extracurricular activities
summary(student_data$Hours_Studied[student_data$Extracurricular_Activities == "Yes"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.93 24.00 43.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 16.00 20.00 20.04 24.00 44.00
The summary statistics of the average number of hours students study per week based on participation in extracurricular activities show no significant differences. We will now use t-test to test for differences in study hours based on participation in extracurricular activities. But first, we need to check the assumptions of t-test.
# Sample the data
participate_extracurricular <- student_data$Hours_Studied[student_data$Extracurricular_Activities == "Yes"][sample(sum(student_data$Extracurricular_Activities == "Yes"), 100)]
do_not_participate_extracurricular <- student_data$Hours_Studied[student_data$Extracurricular_Activities == "No"][sample(sum(student_data$Extracurricular_Activities == "No"), 100)]
# Histogram of study hours by extracurricular activities
par(mfrow = c(1, 2))
hist(participate_extracurricular, main = "Extracurricular Activities", xlab = "Study Hours", col = "skyblue", border = "black")
hist(do_not_participate_extracurricular, main = "No Extracurricular Activities", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: participate_extracurricular
W = 0.98586, p-value = 0.3655
Shapiro-Wilk normality test
data: do_not_participate_extracurricular
W = 0.98909, p-value = 0.5916
Both the histograms and the Shapiro-Wilk test show that the distribution of study hours is approximately normal for all students. We will now use t-test to test for differences in study hours based on participation in extracurricular activities.
# t-test test for study hours by extracurricular activities
t.test(participate_extracurricular, do_not_participate_extracurricular)
Welch Two Sample t-test
data: participate_extracurricular and do_not_participate_extracurricular
t = -0.99317, df = 197.84, p-value = 0.3218
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.5079006 0.8279006
sample estimates:
mean of x mean of y
20.29 21.13
The t-test shows that there is no significant difference in the average number of hours students study per week based on participation in extracurricular activities, with a p-value greater than 0.05. This indicates that participation in extracurricular activities does not have a significant impact on the number of hours students study per week.
Next, we explore the average number of hours students study per week based on motivation levels.
# Summary statistics for study hours by motivation level
summary(student_data$Hours_Studied[student_data$Motivation_Level == "High"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.75 24.00 39.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 20.06 24.00 43.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 16.00 20.00 19.98 24.00 44.00
The summary statistics of the average number of hours students study per week based on motivation levels show no significant differences. We will now use ANOVA to test for differences in study hours based on motivation levels. But first, we need to check the assumptions of ANOVA.
# Sample the data
high_motivation <- student_data$Hours_Studied[student_data$Motivation_Level == "High"][sample(sum(student_data$Motivation_Level == "High"), 100)]
medium_motivation <- student_data$Hours_Studied[student_data$Motivation_Level == "Medium"][sample(sum(student_data$Motivation_Level == "Medium"), 100)]
low_motivation <- student_data$Hours_Studied[student_data$Motivation_Level == "Low"][sample(sum(student_data$Motivation_Level == "Low"), 100)]
# Histogram of study hours by motivation level
par(mfrow = c(1, 3))
hist(high_motivation, main = "High Motivation Level", xlab = "Study Hours", col = "skyblue", border = "black")
hist(medium_motivation, main = "Medium Motivation Level", xlab = "Study Hours", col = "skyblue", border = "black")
hist(low_motivation, main = "Low Motivation Level", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: high_motivation
W = 0.98676, p-value = 0.4217
Shapiro-Wilk normality test
data: medium_motivation
W = 0.9649, p-value = 0.009127
Shapiro-Wilk normality test
data: low_motivation
W = 0.99065, p-value = 0.717
Shapiro-Wilk test shows that the distribution of study hours of students with medium motivation is not normal, with a p-value less than 0.05. We will now use Kruksal-Wallis test to test for differences in study hours based on motivation levels.
# Kruskal-Wallis test for study hours by motivation level
kruskal.test(Hours_Studied ~ Motivation_Level, data = student_data)
Kruskal-Wallis rank sum test
data: Hours_Studied by Motivation_Level
Kruskal-Wallis chi-squared = 2.9587, df = 2, p-value = 0.2278
The Kruskal-Wallis test shows that there is no significant difference in the average number of hours students study per week based on motivation levels, with a p-value greater than 0.05. This indicates that motivation levels do not have a significant impact on the number of hours students study per week.
Next, we explore the average number of hours students study per week based on internet access.
# Summary statistics for study hours by internet access
summary(student_data$Hours_Studied[student_data$Internet_Access == "Yes"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.99 24.00 44.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.00 15.00 20.00 19.83 24.00 37.00
The summary statistics of the average number of hours students study per week based on internet access show no significant differences. We will now use t-test to test for differences in study hours based on internet access. But first, we need to check the assumptions of t-test.
# Sample the data
internet_access <- student_data$Hours_Studied[student_data$Internet_Access == "Yes"][sample(sum(student_data$Internet_Access == "Yes"), 100)]
no_internet_access <- student_data$Hours_Studied[student_data$Internet_Access == "No"][sample(sum(student_data$Internet_Access == "No"), 100)]
# Histogram of study hours by internet access
par(mfrow = c(1, 2))
hist(internet_access, main = "Internet Access", xlab = "Study Hours", col = "skyblue", border = "black")
hist(no_internet_access, main = "No Internet Access", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: internet_access
W = 0.99238, p-value = 0.8479
Shapiro-Wilk normality test
data: no_internet_access
W = 0.99029, p-value = 0.6878
Both the histograms and the Shapiro-Wilk test show that the distribution of study hours is approximately normal for all students. We will now use t-test to test for differences in study hours based on internet access.
Welch Two Sample t-test
data: internet_access and no_internet_access
t = -2.2355, df = 197.79, p-value = 0.0265
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.7078465 -0.2321535
sample estimates:
mean of x mean of y
18.90 20.87
The p-value is less than 0.05, indicating that there is a significant difference in the average number of hours students study per week based on internet access. This suggests that internet access has a significant impact on the number of hours students study per week.
Next, we explore the average number of hours students study per week based on family income levels.
# Summary statistics for study hours by family income
summary(student_data$Hours_Studied[student_data$Family_Income == "Low"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.93 24.00 39.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 20.07 24.00 44.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.00 16.00 20.00 19.89 24.00 39.00
The summary statistics of the average number of hours students study per week based on family income levels show no significant differences. We will now use ANOVA to test for differences in study hours based on family income levels. But first, we need to check the assumptions of ANOVA.
# Sample the data
low_family_income <- student_data$Hours_Studied[student_data$Family_Income == "Low"][sample(sum(student_data$Family_Income == "Low"), 100)]
medium_family_income <- student_data$Hours_Studied[student_data$Family_Income == "Medium"][sample(sum(student_data$Family_Income == "Medium"), 100)]
high_family_income <- student_data$Hours_Studied[student_data$Family_Income == "High"][sample(sum(student_data$Family_Income == "High"), 100)]
# Histogram of study hours by family income
par(mfrow = c(1, 3))
hist(low_family_income, main = "Low Family Income", xlab = "Study Hours", col = "skyblue", border = "black")
hist(medium_family_income, main = "Medium Family Income", xlab = "Study Hours", col = "skyblue", border = "black")
hist(high_family_income, main = "High Family Income", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: low_family_income
W = 0.99118, p-value = 0.7595
Shapiro-Wilk normality test
data: medium_family_income
W = 0.98624, p-value = 0.3883
Shapiro-Wilk normality test
data: high_family_income
W = 0.98195, p-value = 0.1881
Both the histograms and the Shapiro-Wilk test show that the distribution of study hours is approximately normal for all students. We will now use ANOVA to test for differences in study hours based on family income levels.
# ANOVA test for study hours by family income
anova_family_income <- aov(Hours_Studied ~ Family_Income, data = student_data)
summary(anova_family_income) Df Sum Sq Mean Sq F value Pr(>F)
Family_Income 2 35 17.47 0.487 0.614
Residuals 6374 228376 35.83
The ANOVA test shows that there is no significant difference in the average number of hours students study per week based on family income levels, with a p-value greater than 0.05. This indicates that family income does not have a significant impact on the number of hours students study per week.
Next, we explore the average number of hours students study per week based on teacher quality levels.
# Summary statistics for study hours by teacher quality
summary(student_data$Hours_Studied[student_data$Teacher_Quality == "Low"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
4.00 16.00 20.00 20.05 24.00 39.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.99 24.00 39.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 16.00 20.00 19.92 24.00 44.00
The summary statistics of the average number of hours students study per week based on teacher quality levels show no significant differences. We will now use ANOVA to test for differences in study hours based on teacher quality levels. But first, we need to check the assumptions of ANOVA.
# Sample the data
low_teacher_quality <- student_data$Hours_Studied[student_data$Teacher_Quality == "Low"][sample(sum(student_data$Teacher_Quality == "Low"), 100)]
medium_teacher_quality <- student_data$Hours_Studied[student_data$Teacher_Quality == "Medium"][sample(sum(student_data$Teacher_Quality == "Medium"), 100)]
high_teacher_quality <- student_data$Hours_Studied[student_data$Teacher_Quality == "High"][sample(sum(student_data$Teacher_Quality == "High"), 100)]
# Histogram of study hours by teacher quality
par(mfrow = c(1, 3))
hist(low_teacher_quality, main = "Low Teacher Quality", xlab = "Study Hours", col = "skyblue", border = "black")
hist(medium_teacher_quality, main = "Medium Teacher Quality", xlab = "Study Hours", col = "skyblue", border = "black")
hist(high_teacher_quality, main = "High Teacher Quality", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: low_teacher_quality
W = 0.98158, p-value = 0.1763
Shapiro-Wilk normality test
data: medium_teacher_quality
W = 0.98216, p-value = 0.1952
Shapiro-Wilk normality test
data: high_teacher_quality
W = 0.98864, p-value = 0.5568
All three histograms and the Shapiro-Wilk test show that the distribution of study hours is approximately normal for all students. We will now use ANOVA to test for differences in study hours based on teacher quality levels.
# ANOVA test for study hours by teacher quality
anova_teacher_quality <- aov(Hours_Studied ~ Teacher_Quality, data = student_data)
summary(anova_teacher_quality) Df Sum Sq Mean Sq F value Pr(>F)
Teacher_Quality 2 12 5.88 0.164 0.849
Residuals 6374 228400 35.83
The ANOVA test shows that there is no significant difference in the average number of hours students study per week based on teacher quality levels, with a p-value greater than 0.05. This indicates that teacher quality does not have a significant impact on the number of hours students study per week.
Next, we explore the average number of hours students study per week based on school type.
# Summary statistics for study hours by school type
summary(student_data$Hours_Studied[student_data$School_Type == "Public"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.98 24.00 43.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.97 24.00 44.00
The summary statistics of the average number of hours students study per week based on school type show no significant differences. We will now use t-test to test for differences in study hours based on school type. But first, we need to check the assumptions of t-test.
# Sample the data
public_school <- student_data$Hours_Studied[student_data$School_Type == "Public"][sample(sum(student_data$School_Type == "Public"), 100)]
private_school <- student_data$Hours_Studied[student_data$School_Type == "Private"][sample(sum(student_data$School_Type == "Private"), 100)]
# Histogram of study hours by school type
par(mfrow = c(1, 2))
hist(public_school, main = "Public School", xlab = "Study Hours", col = "skyblue", border = "black")
hist(private_school, main = "Private School", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: public_school
W = 0.98459, p-value = 0.2967
Shapiro-Wilk normality test
data: private_school
W = 0.98734, p-value = 0.4609
Both histograms and the Shapiro-Wilk test show that the distribution of study hours is approximately normal for all students. We will now use t-test to test for differences in study hours based on school type.
Welch Two Sample t-test
data: public_school and private_school
t = -0.26017, df = 197.45, p-value = 0.795
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.887563 1.447563
sample estimates:
mean of x mean of y
19.33 19.55
The t-test test shows that there is no significant difference in the average number of hours students study per week based on school type, with a p-value greater than 0.05. This indicates that school type does not have a significant impact on the number of hours students study per week.
Next, we explore the average number of hours students study per week based on peer influence levels.
# Summary statistics for study hours by peer influence
summary(student_data$Hours_Studied[student_data$Peer_Influence == "Positive"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 20.06 24.00 43.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.95 24.00 44.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 16.00 20.00 19.91 24.00 39.00
The summary statistics of the average number of hours students study per week based on peer influence levels show no significant differences. We will now use ANOVA to test for differences in study hours based on peer influence levels. But first, we need to check the assumptions of ANOVA.
# Sample the data
positive_peer_influence <- student_data$Hours_Studied[student_data$Peer_Influence == "Positive"][sample(sum(student_data$Peer_Influence == "Positive"), 100)]
negative_peer_influence <- student_data$Hours_Studied[student_data$Peer_Influence == "Negative"][sample(sum(student_data$Peer_Influence == "Negative"), 100)]
neutral_peer_influence <- student_data$Hours_Studied[student_data$Peer_Influence == "Neutral"][sample(sum(student_data$Peer_Influence == "Neutral"), 100)]
# Histogram of study hours by peer influence
par(mfrow = c(1, 3))
hist(positive_peer_influence, main = "Positive Peer Influence", xlab = "Study Hours", col = "skyblue", border = "black")
hist(negative_peer_influence, main = "Negative Peer Influence", xlab = "Study Hours", col = "skyblue", border = "black")
hist(neutral_peer_influence, main = "Neutral Peer Influence", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: positive_peer_influence
W = 0.98724, p-value = 0.454
Shapiro-Wilk normality test
data: negative_peer_influence
W = 0.98803, p-value = 0.5107
Shapiro-Wilk normality test
data: neutral_peer_influence
W = 0.97778, p-value = 0.08907
All three histograms and the Shapiro-Wilk test show that the distribution of study hours is approximately normal for all students. We will now use ANOVA to test for differences in study hours based on peer influence levels.
# ANOVA test for study hours by peer influence
anova_peer_influence <- aov(Hours_Studied ~ Peer_Influence, data = student_data)
summary(anova_peer_influence) Df Sum Sq Mean Sq F value Pr(>F)
Peer_Influence 2 29 14.33 0.4 0.67
Residuals 6374 228383 35.83
The ANOVA test shows that there is no significant difference in the average number of hours students study per week based on peer influence levels, with a p-value greater than 0.05. This indicates that peer influence does not have a significant impact on the number of hours students study per week.
Next, we explore the average number of hours students study per week based on learning disability.
# Summary statistics for study hours by learning disability
summary(student_data$Hours_Studied[student_data$Learning_Disabilities == "Yes"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
4.00 16.00 20.00 19.73 24.00 35.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 16 20 20 24 44
The summary statistics of the average number of hours students study per week based on learning disability show no significant differences. We will now use t-test to test for differences in study hours based on learning disability. But first, we need to check the assumptions of t-test.
# Sample the data
learning_disabilities <- student_data$Hours_Studied[student_data$Learning_Disabilities == "Yes"][sample(sum(student_data$Learning_Disabilities == "Yes"), 100)]
no_learning_disabilities <- student_data$Hours_Studied[student_data$Learning_Disabilities == "No"][sample(sum(student_data$Learning_Disabilities == "No"), 100)]
# Histogram of study hours by learning disability
par(mfrow = c(1, 2))
hist(learning_disabilities, main = "Learning Disabilities", xlab = "Study Hours", col = "skyblue", border = "black")
hist(no_learning_disabilities, main = "No Learning Disabilities", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: learning_disabilities
W = 0.98009, p-value = 0.135
Shapiro-Wilk normality test
data: no_learning_disabilities
W = 0.97305, p-value = 0.03802
The Shapiro-Wilk test shows that the distribution of study hours is approximately normal for students with learning disabilities, with a p-value greater than 0.05. However, the distribution for students without learning disabilities is slightly skewed, with a p-value less than 0.05. We will now use wilcoxon rank sum test to test for differences in study hours based on learning disability.
# Wilcoxon rank sum test for study hours by learning disability
wilcox.test(learning_disabilities, no_learning_disabilities)
Wilcoxon rank sum test with continuity correction
data: learning_disabilities and no_learning_disabilities
W = 5135, p-value = 0.742
alternative hypothesis: true location shift is not equal to 0
The t-test test shows that there is no significant difference in the average number of hours students study per week based on learning disability, with a p-value greater than 0.05. This indicates that learning disabilities do not have a significant impact on the number of hours students study per week.
Next, we explore the average number of hours students study per week based on parental education levels.
# Summary statistics for study hours by parental education level
summary(student_data$Hours_Studied[student_data$Parental_Education_Level == "High School"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 20.05 24.00 44.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.87 24.00 39.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.00 16.00 20.00 19.97 24.00 39.00
The summary statistics of the average number of hours students study per week based on parental education levels show no significant differences. We will now use ANOVA to test for differences in study hours based on parental education levels. But first, we need to check the assumptions of ANOVA.
# Sample the data
high_school_education <- student_data$Hours_Studied[student_data$Parental_Education_Level == "High School"][sample(sum(student_data$Parental_Education_Level == "High School"), 100)]
college_education <- student_data$Hours_Studied[student_data$Parental_Education_Level == "College"][sample(sum(student_data$Parental_Education_Level == "College"), 100)]
postgraduate_education <- student_data$Hours_Studied[student_data$Parental_Education_Level == "Postgraduate"][sample(sum(student_data$Parental_Education_Level == "Postgraduate"), 100)]
# Histogram of study hours by parental education level
par(mfrow = c(1, 3))
hist(high_school_education, main = "High School Education", xlab = "Study Hours", col = "skyblue", border = "black")
hist(college_education, main = "College Education", xlab = "Study Hours", col = "skyblue", border = "black")
hist(postgraduate_education, main = "Postgraduate Education", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: high_school_education
W = 0.98413, p-value = 0.2746
Shapiro-Wilk normality test
data: college_education
W = 0.97967, p-value = 0.1253
Shapiro-Wilk normality test
data: postgraduate_education
W = 0.98131, p-value = 0.168
All three histograms and the Shapiro-Wilk test show that the distribution of study hours is approximately normal for all students. We will now use ANOVA to test for differences in study hours based on parental education levels.
# ANOVA test for study hours by parental education level
anova_parental_education_level <- aov(Hours_Studied ~ Parental_Education_Level, data = student_data)
summary(anova_parental_education_level) Df Sum Sq Mean Sq F value Pr(>F)
Parental_Education_Level 2 37 18.71 0.522 0.593
Residuals 6374 228374 35.83
The ANOVA test shows that there is no significant difference in the average number of hours students study per week based on parental education levels, with a p-value greater than 0.05. This indicates that parental education levels do not have a significant impact on the number of hours students study per week.
Next, we explore the average number of hours students study per week based on the distance from home to school.
# Summary statistics for study hours by distance from home (Near, Moderate, Far
summary(student_data$Hours_Studied[student_data$Distance_from_Home == "Near"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 16.0 20.0 19.9 24.0 43.0
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 16.00 20.00 20.05 24.00 44.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.0 16.0 20.0 20.2 24.0 39.0
The summary statistics of the average number of hours students study per week based on the distance from home to school show no significant differences. We will now use ANOVA to test for differences in study hours based on the distance from home to school. But first, we need to check the assumptions of ANOVA.
# Sample the data
near_distance <- student_data$Hours_Studied[student_data$Distance_from_Home == "Near"][sample(sum(student_data$Distance_from_Home == "Near"), 100)]
moderate_distance <- student_data$Hours_Studied[student_data$Distance_from_Home == "Moderate"][sample(sum(student_data$Distance_from_Home == "Moderate"), 100)]
far_distance <- student_data$Hours_Studied[student_data$Distance_from_Home == "Far"][sample(sum(student_data$Distance_from_Home == "Far"), 100)]
# Histogram of study hours by distance from home
par(mfrow = c(1, 3))
hist(near_distance, main = "Near Distance from Home", xlab = "Study Hours", col = "skyblue", border = "black")
hist(moderate_distance, main = "Moderate Distance from Home", xlab = "Study Hours", col = "skyblue", border = "black")
hist(far_distance, main = "Far Distance from Home", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: near_distance
W = 0.97892, p-value = 0.1094
Shapiro-Wilk normality test
data: moderate_distance
W = 0.99076, p-value = 0.7258
Shapiro-Wilk normality test
data: far_distance
W = 0.98969, p-value = 0.6399
All three histograms and the Shapiro-Wilk test show that the distribution of study hours is approximately normal for all students. We will now use ANOVA to test for differences in study hours based on the distance from home to school.
# ANOVA test for study hours by distance from home
anova_distance_from_home <- aov(Hours_Studied ~ Distance_from_Home, data = student_data)
summary(anova_distance_from_home) Df Sum Sq Mean Sq F value Pr(>F)
Distance_from_Home 2 67 33.48 0.934 0.393
Residuals 6374 228344 35.82
The ANOVA test shows that there is no significant difference in the average number of hours students study per week based on the distance from home to school, with a p-value greater than 0.05. This indicates that the distance from home to school does not have a significant impact on the number of hours students study per week.
Finally, we explore the average number of hours students study per week based on gender.
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.94 24.00 39.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 20.02 24.00 44.00
The summary statistics of the average number of hours students study per week based on gender show no significant differences. We will now use ANOVA to test for differences in study hours based on gender. But first, we need to check the assumptions of ANOVA.
# Sample the data
male_data <- student_data$Hours_Studied[student_data$Gender == "Male"][sample(sum(student_data$Gender == "Male"), 100)]
female_data <- student_data$Hours_Studied[student_data$Gender == "Female"][sample(sum(student_data$Gender == "Female"), 100)]
# Histogram
par(mfrow = c(1, 2))
hist(male_data, main = "Male", xlab = "Study Hours", col = "skyblue", border = "black")
hist(female_data, main = "Female", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: male_data
W = 0.99016, p-value = 0.6773
Shapiro-Wilk normality test
data: female_data
W = 0.9903, p-value = 0.6889
Both histograms and the Shapiro-Wilk test show that the distribution of study hours is approximately normal for all students. We will now use ANOVA to test for differences in study hours based gender.
Df Sum Sq Mean Sq F value Pr(>F)
Gender 1 11 11.12 0.31 0.577
Residuals 6375 228400 35.83
The ANOVA test shows that there is no significant difference in the average number of hours students study based on gender.
In summary, the average number of hours students study per week is approximately 19.98 hours. There are no significant differences in the average number of hours students study per week based on various factors such as parental involvement, access to resources, extracurricular activities, motivation levels, internet access, family income, teacher quality, school type, peer influence, learning disabilities, parental education levels, distance from home to school, and Gender. This indicates that the number of hours students study per week is consistent across different factors.
Now we analyze the variation in attendance rates across students.
# Sample the data
sample_attendance <- student_data$Attendance[sample(nrow(student_data), 1000)]
# Summary statistics for attendance rates
summary(sample_attendance) Min. 1st Qu. Median Mean 3rd Qu. Max.
60.0 70.0 81.0 80.8 92.0 100.0
# Histogram of attendance rates
hist(sample_attendance, main = "Distribution of Attendance Rates", xlab = "Attendance Rate", col = "skyblue", border = "black")From the histogram we can see that the distribution of attendance
rates is approximately uniform, ranging from approximately
60% to 100%.
We can see one slightly higher peak around
70% to 80%.
From this we can see that attendance rates are evenly distributed across
students. The average attendance rate is approximately
79.75%, which is relatively high. We now
investigate the factors that may influence attendance rates.
We will now explore the attendance rates based on parental involvement levels.
# Summary statistics for attendance rates by parental involvement
summary(student_data$Attendance[student_data$Parental_Involvement == "High"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
60.00 70.00 80.00 80.01 90.00 100.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
60.0 70.0 80.0 79.9 90.0 100.0
Min. 1st Qu. Median Mean 3rd Qu. Max.
60.00 70.00 80.00 80.32 91.00 100.00
The summary statistics of the attendance rates based on parental involvement levels show no significant differences. We will now use ANOVA to test for differences in attendance rates based on parental involvement levels. But first, we need to check the assumptions of ANOVA.
# Sample the data
high_parental_involvement_attendance <- student_data$Attendance[student_data$Parental_Involvement == "High"][sample(sum(student_data$Parental_Involvement == "High"), 100)]
medium_parental_involvement_attendance <- student_data$Attendance[student_data$Parental_Involvement == "Medium"][sample(sum(student_data$Parental_Involvement == "Medium"), 100)]
low_parental_involvement_attendance <- student_data$Attendance[student_data$Parental_Involvement == "Low"][sample(sum(student_data$Parental_Involvement == "Low"), 100)]
# Histogram of attendance rates by parental involvement
par(mfrow = c(1, 3))
hist(high_parental_involvement_attendance, main = "High Parental Involvement", xlab = "Attendance Rate", col = "skyblue", border = "black")
hist(medium_parental_involvement_attendance, main = "Medium Parental Involvement", xlab = "Attendance Rate", col = "skyblue", border = "black")
hist(low_parental_involvement_attendance, main = "Low Parental Involvement", xlab = "Attendance Rate", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: high_parental_involvement_attendance
W = 0.9466, p-value = 0.0004986
Shapiro-Wilk normality test
data: medium_parental_involvement_attendance
W = 0.92294, p-value = 2.023e-05
Shapiro-Wilk normality test
data: low_parental_involvement_attendance
W = 0.928, p-value = 3.842e-05
All three histograms and the Shapiro-Wilk test show that the distribution of attendance rates is not normal for all students. We will now use Kruksal-Wallis test to test for differences in attendance rates based on parental involvement levels.
Kruskal-Wallis rank sum test
data: Attendance by Parental_Involvement
Kruskal-Wallis chi-squared = 1.2171, df = 2, p-value = 0.5441
The Kruskal-Wallis test shows that there is no significant difference in the average attendance rates based on parental involvement levels, with a p-value greater than 0.05. This indicates that parental involvement does not have a significant impact on attendance rates.
Next, we explore the attendance rates based on access to resources.
# Summary statistics for attendance rates by access to resources
summary(student_data$Attendance[student_data$Access_to_Resources == "High"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
60.00 70.00 79.00 79.87 90.00 100.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
60 70 80 80 90 100
Min. 1st Qu. Median Mean 3rd Qu. Max.
60.00 71.00 80.00 80.29 91.00 100.00
The summary statistics of the attendance rates based on access to resources show no significant differences. We will now use ANOVA to test for differences in attendance rates based on access to resources. But first, we need to check the assumptions of ANOVA.
# Sample the data
high_access_to_resources_attendance <- student_data$Attendance[student_data$Access_to_Resources == "High"][sample(sum(student_data$Access_to_Resources == "High"), 100)]
medium_access_to_resources_attendance <- student_data$Attendance[student_data$Access_to_Resources == "Medium"][sample(sum(student_data$Access_to_Resources == "Medium"), 100)]
low_access_to_resources_attendance <- student_data$Attendance[student_data$Access_to_Resources == "Low"][sample(sum(student_data$Access_to_Resources == "Low"), 100)]
# Histogram of attendance rates by access to resources
par(mfrow = c(1, 3))
hist(high_access_to_resources_attendance, main = "High Access to Resources", xlab = "Attendance Rate", col = "skyblue", border = "black")
hist(medium_access_to_resources_attendance, main = "Medium Access to Resources", xlab = "Attendance Rate", col = "skyblue", border = "black")
hist(low_access_to_resources_attendance, main = "Low Access to Resources", xlab = "Attendance Rate", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: high_access_to_resources_attendance
W = 0.95112, p-value = 0.0009836
Shapiro-Wilk normality test
data: medium_access_to_resources_attendance
W = 0.94251, p-value = 0.0002752
Shapiro-Wilk normality test
data: low_access_to_resources_attendance
W = 0.93998, p-value = 0.0001923
All three histograms and the Shapiro-Wilk test show that the distribution of attendance rates is not normal for all students. We will now use Kruksal-Wallis test to test for differences in attendance rates based on access to resources.
Kruskal-Wallis rank sum test
data: Attendance by Access_to_Resources
Kruskal-Wallis chi-squared = 1.0494, df = 2, p-value = 0.5917
The Kruskal-Wallis test shows that there is no significant difference in the average attendance rates based on access to resources, with a p-value greater than 0.05. This indicates that access to resources does not have a significant impact on attendance rates.
Next, we explore the attendance rates based on participation in extracurricular activities.
# Summary statistics for attendance rates by extracurricular activities
summary(student_data$Attendance[student_data$Extracurricular_Activities == "Yes"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
60 70 80 80 90 100
Min. 1st Qu. Median Mean 3rd Qu. Max.
60.00 70.00 80.00 80.05 90.00 100.00
The summary statistics of the attendance rates based on participation in extracurricular activities show no significant differences. We will now use ANOVA to test for differences in attendance rates based on participation in extracurricular activities. But first, we need to check the assumptions of ANOVA.
# Sample the data
participate_extracurricular_attendance <- student_data$Attendance[student_data$Extracurricular_Activities == "Yes"][sample(sum(student_data$Extracurricular_Activities == "Yes"), 100)]
do_not_participate_extracurricular_attendance <- student_data$Attendance[student_data$Extracurricular_Activities == "No"][sample(sum(student_data$Extracurricular_Activities == "No"), 100)]
# Histogram of attendance rates by extracurricular activities
par(mfrow = c(1, 2))
hist(participate_extracurricular_attendance, main = "Extracurricular Activities", xlab = "Attendance Rate", col = "skyblue", border = "black")
hist(do_not_participate_extracurricular_attendance, main = "No Extracurricular Activities", xlab = "Attendance Rate", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: participate_extracurricular_attendance
W = 0.96044, p-value = 0.004321
Shapiro-Wilk normality test
data: do_not_participate_extracurricular_attendance
W = 0.93609, p-value = 0.0001122
Both histograms and the Shapiro-Wilk test show that the distribution of attendance rates is not normal for all students. We will now use Kruksal-Wallis test to test for differences in attendance rates based on participation in extracurricular activities.
Kruskal-Wallis rank sum test
data: Attendance by Extracurricular_Activities
Kruskal-Wallis chi-squared = 0.026414, df = 1, p-value = 0.8709
The Kruskal-Wallis test shows that there is no significant difference in the average attendance rates based on participation in extracurricular activities, with a p-value greater than 0.05. This indicates that participation in extracurricular activities does not have a significant impact on attendance rates.